Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem) by iangmaia · Pull Request #25444 · wordpress-mobile/WordPress-iOS

iangmaia · 2026-03-24T13:49:48Z

Summary

Adds a Buildkite command script and pipeline step for running AI E2E tests using the simulator-llm-pilot gem
Checks for "Testing" label on PR (skips if missing to save CI resources)
Downloads build artifacts, installs app on simulator, installs the gem from GitHub, runs tests

The gem handles everything internally: simulator detection, WDA lifecycle, agent loop with sandboxed tools, context window compression, verification/cleanup enforcement, and structured results.

Alternative approach: see #25443 for a Claude Code + wrapper scripts version of the same pipeline.

Ref: AINFRA-2176

Test plan

Run .buildkite/commands/run-ai-e2e-tests.sh locally with a booted simulator and test site credentials
Run a simple test case (users-screen-loads.md) end-to-end
Verify results.md is written with correct pass/fail status

🤖 Generated with Claude Code

dangermattic · 2026-03-24T13:50:19Z

	1 Message
📖	This PR is still a Draft: some checks will be skipped.

Generated by 🚫 Danger

wpmobilebot · 2026-03-24T14:01:35Z

📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.

	App Name	WordPress
	Configuration	Release-Alpha
	Build Number	`32185`
	Version	`PR #25444`
	Bundle ID	`org.wordpress.alpha`
	Commit	`146daa1`
	Installation URL	7l99o9dpqifu8

Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

wpmobilebot · 2026-03-24T14:01:49Z

📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.

	App Name	Jetpack
	Configuration	Release-Alpha
	Build Number	`32185`
	Version	`PR #25444`
	Bundle ID	`com.jetpack.alpha`
	Commit	`146daa1`
	Installation URL	04hfgpqarjii0

Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

sonarqubecloud · 2026-03-31T17:34:00Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

crazytonyli · 2026-05-08T03:35:44Z

Hi @iangmaia , shall we land this and start running nightly jobs?

The gem provides a sandboxed agent that drives the simulator through a fixed set of tools (tap, swipe, type, REST API) with no arbitrary code execution. It handles WDA lifecycle, session management, context window compression, and verification/cleanup enforcement internally. The Buildkite step: - Checks for "Testing" label (skips if missing) - Downloads build artifacts and installs app on simulator - Installs the simulator-llm-pilot gem from GitHub - Runs all test cases in Tests/AgentTests/ui-tests/ Ref: AINFRA-2176 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gem build resolves spec file paths relative to cwd, so bin/simulator-llm-pilot wasn't found when building from the wordpress-ios repo root. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract WDA build to a separate build-wda.sh script for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The gem no longer hardcodes WordPress login flow in its system prompt. Add app-instructions.md with the WordPress/Jetpack login flow and pass it via --app-instructions-file. Also pass --app-name so the LLM knows the app's display name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iangmaia · 2026-05-08T15:21:07Z

@crazytonyli Hey Tony! With all the recent changes and updates this got left behind 😓 sorry about that. There's not much work left to start running it and iterating on the tests IMO, so that's the good side.
There are a couple of open questions in paaHJt-9Te-p2 related to the tests themselves (one of them always failed iinm).

As mentioned in the P2, I think that this PR + simulator-llm-pilot is the way to go for E2E AI tests, but it would be nice to make it fully 🟢 to start with.

Copilot

Pull request overview

Adds a Buildkite CI step to run AI-driven end-to-end UI tests on an iOS Simulator using the simulator-llm-pilot gem, including helper scripts for installing the gem, locating/booting a simulator, and building WebDriverAgent.

Changes:

Adds a new Buildkite pipeline step (PR-only) to run AI E2E tests and upload Tests/AgentTests/results/** artifacts.
Introduces CI scripts to install simulator-llm-pilot, find a booted simulator, build WebDriverAgent, install the app, and run the test suite.
Updates AI test/navigation skill docs and adds app login instructions used by the test runner.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`Tests/AgentTests/app-instructions.md`	Adds login-flow instructions for the agent-runner to avoid unsafe/manual credential entry.
`Scripts/ci/install-simulator-llm-pilot.sh`	Installs `simulator-llm-pilot` by building from a local checkout or cloning from GitHub.
`Scripts/ci/find-booted-simulator.rb`	Helper to return a booted simulator UDID (optionally waiting/polling).
`.claude/skills/ios-sim-navigation/SKILL.md`	Aligns documentation placeholder naming (`<APP_BUNDLE_ID>`).
`.claude/skills/ai-test-runner/SKILL.md`	Aligns documentation placeholder naming (`<APP_BUNDLE_ID>`).
`.buildkite/pipeline.yml`	Adds a new “AI E2E Tests” Buildkite step gated to PR builds.
`.buildkite/commands/run-ai-e2e-tests.sh`	Orchestrates artifact download, simulator/app setup, WDA build, and `simulator-llm-pilot run`.
`.buildkite/commands/build-wda.sh`	Clones/builds WebDriverAgent and skips rebuild when artifacts already exist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+SIMULATOR_LLM_PILOT_REPO_URL="${SIMULATOR_LLM_PILOT_REPO_URL:-https://github.com/Automattic/simulator-llm-pilot.git}"
+SIMULATOR_LLM_PILOT_SOURCE_PATH="${SIMULATOR_LLM_PILOT_SOURCE_PATH:-}"
+
+build_dir="$(mktemp -d)"
+trap 'rm -rf "$build_dir"' EXIT
+
+source_path="${SIMULATOR_LLM_PILOT_SOURCE_PATH}"
+if [[ -z "$source_path" && -f "${DEFAULT_LOCAL_GEM_PATH}/simulator-llm-pilot.gemspec" ]]; then
+  source_path="${DEFAULT_LOCAL_GEM_PATH}"
+fi
+
+if [[ -n "$source_path" ]]; then
+  echo "Using local simulator-llm-pilot source at ${source_path}"
+  if [[ -d "${source_path}/.git" ]]; then
+    source_revision="$(git -C "${source_path}" rev-parse HEAD)"
+    git -C "${source_path}" archive HEAD | tar -x -C "$build_dir"
+  else
+    source_revision="local-filesystem"
+    tar -cf - -C "${source_path}" . | tar -xf - -C "$build_dir"
+  fi
+else
+  echo "Cloning simulator-llm-pilot from ${SIMULATOR_LLM_PILOT_REPO_URL}"
+  git clone --depth 1 "${SIMULATOR_LLM_PILOT_REPO_URL}" "$build_dir"
+  source_revision="$(git -C "$build_dir" rev-parse HEAD)"
+fi


+WEBDRIVERAGENT_REPO_URL="${WEBDRIVERAGENT_REPO_URL:-https://github.com/appium/WebDriverAgent.git}"
+WEBDRIVERAGENT_REF="${WEBDRIVERAGENT_REF:-}"


+ensure_wda_checkout
+
+if [[ -d "$WDA_PROJECT" ]] && has_built_artifacts; then
+  echo "WebDriverAgent already built, skipping."
+  exit 0


+TIMESTAMP="$(date +%Y-%m-%d-%H%M)"
+RESULTS_DIR="Tests/AgentTests/results/${TIMESTAMP}"


+UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 2>/dev/null || true)"
+if [[ -z "$UDID" ]]; then
+  echo "No booted simulator named '$SIMULATOR_NAME' found. Booting..."
+  xcrun simctl boot "$SIMULATOR_NAME" 2>/dev/null || true
+  UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 30 1 2>/dev/null || true)"


+  output, status = Open3.capture2('xcrun', 'simctl', 'list', 'devices', 'booted', '-j')
+  exit 1 unless status.success?


iangmaia added the [Status] DO NOT MERGE label Mar 24, 2026

iangmaia self-assigned this Mar 24, 2026

iangmaia added the Testing Unit and UI Tests and Tooling label Mar 25, 2026

iangmaia force-pushed the iangmaia/ci-ai-e2e-tests-gem branch 2 times, most recently from 1602fa9 to 8589139 Compare March 30, 2026 17:25

iangmaia and others added 12 commits May 8, 2026 17:16

Use [[ instead of [ for conditional tests

0cce295

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix label check: BUILDKITE_PULL_REQUEST_LABELS is comma-separated

eb15a6e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix gem build: run from inside cloned directory

9550f6a

gem build resolves spec file paths relative to cwd, so bin/simulator-llm-pilot wasn't found when building from the wordpress-ios repo root. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clone and build WebDriverAgent if not present on CI agent

d7d0003

Extract WDA build to a separate build-wda.sh script for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Export SIMULATOR_NAME so build-wda.sh can read it

0edbbf3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Harden gem-backed AI E2E runner

e253eef

Use APP_BUNDLE_ID consistently

c7c5281

Normalize CI site URLs for simulator runs

8fda202

Extend AI E2E timeout to 60 minutes

d8caf46

Fix Rubocop errors

8e60354

iangmaia force-pushed the iangmaia/ci-ai-e2e-tests-gem branch from efadbc8 to 146daa1 Compare May 8, 2026 15:16

iangmaia marked this pull request as ready for review May 8, 2026 15:17

Copilot AI review requested due to automatic review settings May 8, 2026 15:18

Copilot started reviewing on behalf of iangmaia May 8, 2026 15:18 View session

iangmaia requested review from crazytonyli, mokagio and twstokes May 8, 2026 15:21

Copilot AI reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444

Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444
iangmaia wants to merge 12 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests-gem

iangmaia commented Mar 24, 2026 •

edited

Loading

Uh oh!

dangermattic commented Mar 24, 2026

Uh oh!

wpmobilebot commented Mar 24, 2026 •

edited

Loading

Uh oh!

wpmobilebot commented Mar 24, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Mar 31, 2026

Uh oh!

crazytonyli commented May 8, 2026

Uh oh!

iangmaia commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		WEBDRIVERAGENT_REPO_URL="${WEBDRIVERAGENT_REPO_URL:-https://github.com/appium/WebDriverAgent.git}"
		WEBDRIVERAGENT_REF="${WEBDRIVERAGENT_REF:-}"

		TIMESTAMP="$(date +%Y-%m-%d-%H%M)"
		RESULTS_DIR="Tests/AgentTests/results/${TIMESTAMP}"

		output, status = Open3.capture2('xcrun', 'simctl', 'list', 'devices', 'booted', '-j')
		exit 1 unless status.success?

Conversation

iangmaia commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

dangermattic commented Mar 24, 2026

Uh oh!

wpmobilebot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wpmobilebot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Mar 31, 2026

Quality Gate passed

Uh oh!

crazytonyli commented May 8, 2026

Uh oh!

iangmaia commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iangmaia commented Mar 24, 2026 •

edited

Loading

wpmobilebot commented Mar 24, 2026 •

edited

Loading

wpmobilebot commented Mar 24, 2026 •

edited

Loading